Similarity Ranking as Attribute for Machine Learning Approach to Authorship Identification
نویسندگان
چکیده
In the authorship identification task, examples of short writings of N authors and an anonymous document written by one of these N authors are given. The task is to determine the authorship of the anonymous text. Practically all approaches solved this problem with machine learning methods. The input attributes for the machine learning process are usually formed by stylistic or grammatical properties of individual documents or a defined similarity between a document and an author. In this paper, we present the results of an experiment to extend the machine learning attributes by ranking the similarity between a document and an author: we transform the similarity between an unknown document and one of the N authors to the order in which the author is the most similar to the document in the set of N authors. The comparison of similarity probability and similarity ranking was made using the Support Vector Machines algorithm. The results show that machine learning methods perform slightly better with attributes based on the ranking of similarity than with previously used similarity between an author and a document.
منابع مشابه
A Framework for Authorship Identification in the Internet Environment
Misuse of anonymous online communication for illegal purposes has become a major concern [2,12]. In this paper, we present a framework named ART (Authorship Recognition Tool), that is designed to minimize manual procedures and maximize the efficiency of authorship identification based on the content of Internet electronic documents. The framework covers the phases of document retrieval and data...
متن کاملUnsupervised Method for the Authorship Identification Task
This paper presents an approach for tackling the authorship identification task. The approach is based on comparing the similarity between a given unknown document against the known documents using a number of different phrase-level and lexical-syntactic features, so that an unknown document can be classified as having been written by the same author, if the different similarity measures obtain...
متن کاملA Profile-Based Authorship Attribution Approach to Forensic Identification in Chinese Online Messages
With the popularity of Internet technologies and applications, inappropriate or illegal online messages have become a problem for the society. The goal of authorship attribution for anonymous online messages is to identify the authorship from a group of potential suspects for investigation identification. Most previous contributions focused on extracting various writing-style features and emplo...
متن کاملComparing techniques for authorship attribution of source code
Attributing authorship of documents with unknown creators has been studied extensively for natural language text such as essays and literature, but less so for non-natural languages such as computer source code. Previous attempts at attributing authorship of source code can be categorised by two attributes: the software features used for the classification, either strings of n tokens/bytes (n-g...
متن کاملDesigning a Combined-fuzzy Methodology to Improve Organizational Diagnosis Process Effectiveness through Identification and Assessment of Effective Parameters
Organizational diagnosis is a systematic and scientific method to identify, categorize and single out the obstacles and their impact on organizational performance through interaction between internal and external views and preparation and setting up operational plans to solve them in the organization. Providing standard products and emphasizing on the financial measures do not guarantee the sur...
متن کامل